{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "有了ChatGPT的帮助，就可以信心十足地涉略一些之前没经验地领域了。\n",
    "这次的保险欺诈识别任务，数据来源于Kaggle的竞赛题目：https://www.kaggle.com/competitions/fraud-detection-in-insurance-claims\n",
    "建模语言使用Python3，并使用Jupyter来组织给GPT的Promote和代码，完成后可以直接导出为markdown。\n",
    "接下来是数据分析和建模的过程，为了缩短篇幅，GPT的回答部分只保留了代码。\n",
    "\n",
    "问：你现在是一个数据建模人员，要对保险反欺诈进行预测。\n",
    "数据包括：\n",
    "1）训练集：train.csv （40个字段，有fraud_reported字段）\n",
    "2）测试集：test.csv （39个字段，没有fraud_reported字\n",
    "段）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "作为数据建模人员，对保险反欺诈进行预测是一个有趣的任务。我可以帮助你制定一个基本的建模流程。下面是一个通用的步骤：\n",
    "\n",
    "1. 导入必要的库和加载数据集：\n",
    "   - 使用适合数据分析和建模的库，如Pandas、NumPy和Scikit-learn。\n",
    "   - 使用Pandas库的`read_csv`函数加载训练集和测试集数据。\n",
    "\n",
    "2. 数据探索和预处理：\n",
    "   - 使用Pandas库的各种函数和方法来了解数据集的结构、特征和分布。\n",
    "   - 检查缺失值和异常值，并根据需要进行处理。\n",
    "   - 进行特征工程，包括特征选择、特征变换、创建新特征等。\n",
    "\n",
    "3. 划分训练集和验证集：\n",
    "   - 将训练集数据划分为用于训练模型的子集和用于验证模型性能的子集。\n",
    "   - 这可以通过Scikit-learn库中的`train_test_split`函数实现。\n",
    "\n",
    "4. 选择模型和训练：\n",
    "   - 根据问题的特点选择适当的分类模型，如逻辑回归、决策树、随机森林等。\n",
    "   - 使用训练集上的数据对所选模型进行训练。\n",
    "\n",
    "5. 模型评估：\n",
    "   - 使用验证集上的数据评估模型的性能。\n",
    "   - 常见的评估指标包括准确率、精确率、召回率、F1分数等。\n",
    "\n",
    "6. 模型优化和调参：\n",
    "   - 根据模型性能对模型进行优化和调参，以改善预测效果。\n",
    "   - 可以使用交叉验证、网格搜索等技术来找到最佳参数组合。\n",
    "\n",
    "7. 对测试集进行预测：\n",
    "   - 使用训练好的模型对测试集进行预测，并生成预测结果。\n",
    "\n",
    "请注意，上述步骤是一个通用的流程示例，具体实现可能因数据集的特点和建模方法的选择而有所不同。另外，数据预处理、特征工程和模型选择等环节也可能需要根据具体情况进行进一步的调整和优化。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先需要你对数据进行加载，分别得到 train和test。给我对应的Python代码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# 加载训练集和测试集数据\n",
    "train_data = pd.read_csv('train.csv')\n",
    "test_data = pd.read_csv('test.csv')\n",
    "\n",
    "# 查看训练集数据的前几行\n",
    "print(\"训练集数据：\")\n",
    "print(train_data.head())\n",
    "\n",
    "# 查看测试集数据的前几行\n",
    "print(\"测试集数据：\")\n",
    "print(test_data.head())\n",
    "\n",
    "# 检查数据集的维度\n",
    "print(\"训练集数据维度：\", train_data.shape)\n",
    "print(\"测试集数据维度：\", test_data.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对train_data和test_data进行合并，生成data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 合并训练集和测试集\n",
    "data = pd.concat([train_data, test_data], ignore_index=True)\n",
    "\n",
    "# 查看合并后的数据框\n",
    "print(\"合并后的数据：\")\n",
    "print(data.head())\n",
    "\n",
    "# 检查合并后的数据框维度\n",
    "print(\"合并后的数据维度：\", data.shape)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对data进行数据探索，包括：缺失值，字段的唯一值个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 检查每个字段的缺失值数量\n",
    "missing_values = data.isnull().sum()\n",
    "print(\"字段的缺失值数量：\")\n",
    "print(missing_values)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 计算每个字段的唯一值个数\n",
    "unique_value_counts = data.nunique()\n",
    "print(\"字段的唯一值个数：\")\n",
    "print(unique_value_counts)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在data中，去掉没有的字段 '_c39'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 去除字段 '_c39'\n",
    "data = data.drop('_c39', axis=1)\n",
    "\n",
    "# 检查更新后的数据框\n",
    "print(\"更新后的数据：\")\n",
    "print(data.head())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "筛选data中的字符串字段，赋值给cat_columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 筛选出字符串字段\n",
    "cat_columns = data.select_dtypes(include=['object'])\n",
    "\n",
    "# 查看筛选出的字符串字段\n",
    "print(\"字符串字段：\")\n",
    "print(cat_columns.head())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对data中的这些字符串字段，计算唯一值个数，并按照唯一值个数从大到小排序"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 计算字符串字段的唯一值个数\n",
    "unique_counts = data.select_dtypes(include=['object']).nunique()\n",
    "\n",
    "# 按照唯一值个数从大到小排序\n",
    "unique_counts_sorted = unique_counts.sort_values(ascending=False)\n",
    "\n",
    "# 查看排序后的唯一值个数\n",
    "print(\"按照唯一值个数排序后：\")\n",
    "print(unique_counts_sorted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "data中的property_damage字段有3种值，分别为：NO, YES 和 ?，分别将他们映射成0, 1, 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将三种值映射为 0, 1, 2\n",
    "data['property_damage'] = data['property_damage'].replace({\"NO\": 0, \"YES\": 1, \"?\": 2})\n",
    "\n",
    "# 查看更新后的数据框\n",
    "print(\"更新后的数据：\")\n",
    "print(data['property_damage'].head())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对data中的police_report_available字段有3种值，分别为：NO, YES 和 ?，分别将他们映射成0, 1, 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将三种值映射为 0, 1, 2\n",
    "data['police_report_available'] = data['police_report_available'].replace({\"NO\": 0, \"YES\": 1, \"?\": 2})\n",
    "\n",
    "# 查看更新后的数据框\n",
    "print(\"更新后的数据：\")\n",
    "print(data['police_report_available'].head())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "data中有两个字段是日期类型，分别为policy_bind_date, incident_date，为确保他们是日期类型字段，需要将他们格式进行转换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将字段转换为日期类型\n",
    "data['policy_bind_date'] = pd.to_datetime(data['policy_bind_date'])\n",
    "data['incident_date'] = pd.to_datetime(data['incident_date'])\n",
    "\n",
    "# 查看更新后的数据类型\n",
    "print(\"更新后的数据类型：\")\n",
    "print(data[['policy_bind_date', 'incident_date']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 找到 policy_bind_date 的最小日期\n",
    "min_policy_bind_date = data['policy_bind_date'].min()\n",
    "# 计算每个记录的 policy_bind_date_diff\n",
    "data['policy_bind_date_diff'] = (data['policy_bind_date'] - min_policy_bind_date).dt.days\n",
    "\n",
    "# 查看更新后的数据\n",
    "print(\"更新后的数据：\")\n",
    "print(data[['policy_bind_date', 'policy_bind_date_diff']].head())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同理，在data中创建一个新的字段incident_date_diff，它等于 (incident_date - policy_bind_date中日期最小值)的天数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# 计算每个记录的 incident_date_diff\n",
    "data['incident_date_diff'] = (data['incident_date'] - min_policy_bind_date).dt.days\n",
    "\n",
    "# 查看更新后的数据\n",
    "print(\"更新后的数据：\")\n",
    "print(data[['incident_date', 'incident_date_diff']].head())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在data中创建一个新的字段incident_date_policy_bind_date_diff，它等于incident_date_diff - policy_bind_date_diff"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建新的字段 incident_date_policy_bind_date_diff\n",
    "data['incident_date_policy_bind_date_diff'] = data['incident_date_diff'] - data['policy_bind_date_diff']\n",
    "\n",
    "# 查看更新后的数据\n",
    "print(\"更新后的数据：\")\n",
    "print(data[['incident_date_diff', 'policy_bind_date_diff', 'incident_date_policy_bind_date_diff']].head())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在data中去掉原始的日期字段 policy_bind_date和incident_date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 去掉原始的日期字段\n",
    "data = data.drop(['policy_bind_date', 'incident_date'], axis=1)\n",
    "\n",
    "# 查看更新后的数据框\n",
    "print(\"更新后的数据：\")\n",
    "print(data.head())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在data中有字符串类型的字段，需要对他们进行标签编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "# 重新筛选出字符串字段\n",
    "data1=data;\n",
    "cat_columns = data1.select_dtypes(include=['object'])\n",
    "\n",
    "# 创建 LabelEncoder 对象\n",
    "label_encoder = LabelEncoder()\n",
    "\n",
    "\n",
    "# 遍历字符串字段，进行标签编码\n",
    "for column in cat_columns:\n",
    "    data1[column] = label_encoder.fit_transform(data1[column])\n",
    "\n",
    "# 查看更新后的数据\n",
    "print(\"更新后的数据：\")\n",
    "print(data1.head())\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "之前data是由train_data和test_data合并的，现在再将data拆分回train_data和test_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据之前的索引或行数拆分回 train_data 和 test_data\n",
    "train_data = data1[:len(train_data)]\n",
    "test_data = data1[len(train_data):]\n",
    "\n",
    "# 查看拆分后的数据\n",
    "print(\"拆分后的 train_data：\")\n",
    "print(train_data.shape)\n",
    "\n",
    "print(\"拆分后的 test_data：\")\n",
    "print(test_data.shape)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在想要用LightGBM进行建模，对train_data进行训练，对test_data进行预测，其中 fraud_reported是目标字段，预测结果可以是概率值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "# 定义特征和目标字段\n",
    "features = train_data.drop('fraud_reported', axis=1)\n",
    "target = train_data['fraud_reported']\n",
    "\n",
    "# 创建 LightGBM 数据集\n",
    "train_dataset = lgb.Dataset(features, label=target)\n",
    "\n",
    "# 定义 LightGBM 参数\n",
    "params = {\n",
    "    'objective': 'binary',\n",
    "    'metric': 'binary_logloss',\n",
    "    'boosting_type': 'gbdt',\n",
    "    'num_leaves': 31,\n",
    "    'learning_rate': 0.05,\n",
    "    'feature_fraction': 0.9,\n",
    "    'bagging_fraction': 0.8,\n",
    "    'bagging_freq': 5,\n",
    "    'verbose': 0\n",
    "}\n",
    "\n",
    "# 在训练集上训练 LightGBM 模型\n",
    "model = lgb.train(params, train_dataset, num_boost_round=100)\n",
    "\n",
    "# 对测试集进行预测\n",
    "test_features = test_data.drop('fraud_reported', axis=1)\n",
    "test_pred = model.predict(test_features)\n",
    "\n",
    "# 输出预测结果（概率值）\n",
    "print(\"预测结果：\")\n",
    "print(test_pred)\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "读取sampleSubmission.csv，将fraud_reported字段设置为刚才得到的 test_pred，并保存为result_baseline.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取 sampleSubmission.csv 文件\n",
    "submission = pd.read_csv(\"sampleSubmission.csv\")\n",
    "\n",
    "# 将 fraud_reported 字段设置为预测结果 test_pred\n",
    "submission['fraud_reported'] = test_pred\n",
    "\n",
    "# 保存结果为 result_baseline.csv 文件\n",
    "submission.to_csv(\"result_baseline.csv\", index=False)\n",
    "\n",
    "# 打印保存成功的提示信息\n",
    "print(\"结果已保存为 result_baseline.csv 文件\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "_chatgpt",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.2"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "43ac9d14f2d410d90a4db262ca50ec1d295fd4b9d4bd541fb6f3f5897a3ca96d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
